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^ Abstract 

t*^ In genetics it is often of interest to discover single nucleotide poly- 

morphisms (SNPs) that arc directly related to a disease, rather than 
just being associated with it. Few methods exist, however, addressing 
this so-called 'true sparsity recovery' issue. In a thorough simulation 
study, we show that for moderate or low correlation between predictors, 
lasso-based methods perform well at true sparsity recovery, despite 

^_^ not being specifically designed for this purpose. For large correlations, 

H" however, more specialised methods are needed. Stability selection and 

. ^H direct effect testing perform well in all situations, including when the 

j^ correlation is large. 

Cu Keywords: Direct effects, Fine mapping, Large p, Lasso, True sparsity se- 

lection. 

1 Introduction 

In genetic studies where the density of measured single nucleotide poly- 
morphisms (SNPs) is high, it is of interest to recover SNPs that are directly 
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affecting a response. A SNP with a direct effect is defined as one that has an 
effect on the response not caused by its correlation with any other measured 
SNP. Unfortunately, there can be many SNPs associated with the response 
due to high linkage disequilibrium (LD) or correlation with a SNP with a 
direct effect. Define the 'true sparsity pattern' to be the set of SNPs (or 
more generally predictors) that have a direct effect on the response. This 
paper compares the ability of various methods to recover the true sparsity 
pattern. 

Consider a study with n participants typed at p SNPs. The SNPs and 
the response are assumed to be related through 

g{E[Y]) = r, = X(3 + S, (1) 

where rj is an n x 1 response vector explained by an n x p matrix X (the 
SNPs) through an unknown pxl coefficient vector (3 with n x 1 noise vector 



S. The response of interest is Y and g{-) is the link function (see McCullagh 



and Nelder, 1989). We are interested in a sparse solution where only some 



{3j 7^ 0. For a review of sparsity methods see Fan and Lv (2010). SNPs 
with direct effects are difficult to find when the predictors are correlated or 
p > n. In these situations there may exist f3' and /3" such that Xf3' = XP" 



but /3' 7^ /3", causing an identifiability problem (e.g. Candes and Tao, 2007). 
It can therefore be difficult to identify which of a highly correlated group of 
predictors possesses a direct effect. 

We consider the case where the predictors and response are binary. SNPs 
can be expressed in binary format despite having three levels by coding the 
dominant effect and recessive effect in two separate predictors. This allows 
us to capture dominant, recessive and additive effects where the latter is 
represented as significance in both the dominant and recessive effect pre- 
dictors. In genetic studies, the responses are also frequently binary as, for 
example, in case control studies. 

In this paper we compare the lasso, screen and clean, stability selection, 
direct effect testing, and the elastic net; these methods are briefiy described 
in Section [2] Of interest is their ability to recover true sparsity patterns 
in simulations. The methods are compared against Fisher's exact test, the 
standard test for binary association. Section [3] describes the simulations and 
presents the results. A comparison on real data is carried out in Section |4} 
we conclude in Section [H 



2 Sparsity Methods 



2.1 Lasso 



The lasso is a regression technique that performs variable selection (Tibshi 



ram 



1996). Estimation is done by minimising a penalised residual sum of 



squares function, 



P 



argmin {Y - X/3f + A V |/?j | , 
/3 ^^ 



i=i 



where A is a tuning parameter. If A = 0, the lasso is identical to ordinary 
least squares while a larger value of A will cause some coefficients to be 
zero, which corresponds to selection, and the remaining coefficients to be 
shrunk towards zero. The tuning parameter A can be selected through cross 
validation or standard model selection criteria. 

While lasso regression can be computed efficiently using the least angle 



regression algorithm (LARS) of Efron et al. (2004) it suffers from difficulty 



assigning significance to predictors. Meinshausen (2008) describes this nO' 



tion as ill-posed; it may be obvious that an effect is present, but significance 
is lost in trying to assign the effect to a specific predictor. This leads to the 
naive inclusion of all those predictors with nonzero coefficients in the ffiial 
sparse model. 



2.2 Screen and Clean 



Screen and clean (Wasserman and Roeder, 2009) is an extension to the 



lasso that uses a two stage approach to overcome two issues with the lasso: 
first, the difficulty of significance testing in the presence of a large number 
of potentially correlated predictors, and second, the tendency of the lasso 



to underestimate effect sizes (e.g. Radchenko and James, 2009). The first 
'screen' stage fits a lasso regression to half of the data, producing a collection 
of predictors. The second 'clean' stage fits ordinary least squares regression 
to the other half of the data, using the predictors carried forward from the 
first stage. This reduced set of coefficients makes significance testing feasible, 
and avoids effect sizes being underestimated. Significance testing is carried 
out with a Bonferroni correction on the reduced set. 



2.3 Stability Selection 



Stability selection (Meinshausen and Biihlmann 2010) tackles the problem 



of significance testing for predictors by resampling. Each resampled dataset 



is generated by selecting half of the observations at random (without re- 
placement). For each subsample lasso regression is fit, and the predictors 
with nonzero coefficients are recorded as those selected. Let rrij be the count 
of times the j^^ predictor is nonzero. For each predictor Xj, the proportion 
of times that predictor is present in the fitted model is given by ttj = rrij/B. 
Predictor Xj is included in the final model if vr-,- > vrthr , where vrthr is a tun- 
ing parameter. The set of predictors recovered is insensitive to the choice 



of TTthr; for TTthr £ (0.6,0.9) ( Meiushauseu and Biihlmann 2010). Note that 



other subset selection method then the lasse can be considered; we use lasso 
regression for comparability with the other methods. 

2.4 Elastic Net 



The elastic net (Zou and Hastie, 2005) combines lasso and ridge regression 



(Hoerl and Kennard, 1988). Like the lasso, it minimises a penalised residual 



sum of squares function, but there is an additional quadratic penalty term: 

p p 

= argmin {Y - Xf3f + Aa J^ |/3j| + A(l - a) J] /3|. (3) 

^ i=i i=i 

As a consequence, amongst a set of highly correlated predictors, the elastic 
net tends to include either all or none of those predictors in the model. This 
is in contrast to the lasso, which tends to select a single predictor from a 
group of highly correlated predictors that are correlated with the response. 

2.5 Direct Effect Testing 



Direct effect testing (DET) (Sperrin and Jaki, 2010) detects direct effects. 



and calculates the probabilities that each predictor is the true origin of each 
direct effect. The first stage of DET identifies direct effects by using lasso 
regression to separate direct effects from indirect effects. To calculate the 
significance of an effect, each direct effect is attributed, automatically by the 
lasso, to a specific predictor. The second stage of DET then incorporates 
the uncertainty in the predictor that is the true origin of each direct effect, 
by giving, for each predictor Xk on which an effect is observed. 



Pr[Xj true effect [Effect observed in stage one on X, 



fc • 



for a collection of predictors Xj in high correlation with X^. 

DET therefore provides a method to carry out fine mapping by distin- 
guishing between direct and indirect effects, and quantifying the uncertainty 
associated with this distinction. 



2.6 Pre-Screening Methods 

In our simulations, we restrict the number of predictors to numbers in the 
thousands. Genetic data, however, frequently involve several hundred thou- 



sand predictors. Fan and Lv (2008) suggest an initial step in which the 



dimensionality is reduced to a more manageable level using "sure indepen- 
dence screening". This is followed by application of the other methods 
described in this section. In our investigations we will assume that, when 
appropriate, some dimension reduction approach has been applied first to 
reduce the number of predictors. 

3 Design and Results of Simulations 

A range of simulations are carried out to study the properties of the meth- 
ods introduced above. Situations involving strong and weak correlations, 
including both serially correlated and clustered data, are considered and 
consistency properties of the methods on these simulated data are investi- 
gated. We begin by explaining how the various methods are applied, how 
significance is determined, and introduce some of the measures used to com- 
pare the methods. 

For the lasso, the penalty is selected using cross validation; a find is 
recorded when a predictor receives a nonzero coefficient. For screen and 
clean, the lasso penalty is selected by cross validation and significance is 
determined using a Bonferroni correction. The results for stability selec- 
tion are based on 100 resamples with the lasso penalty determined by cross 
validation and the significance threshold set at TTthr = 0.75. We calculate 
elastic net solutions for three values of a: 0.9, 0.75 and 0.5. The penalty 
term A is chosen via cross validation. Only a = 0.5 is displayed in the pro- 
ceeding simulations. As a — >• 1 the elastic net solution becomes identical to 
the lasso, hence elastic net solutions with larger a fall between the elastic 
net solutions displayed, and the lasso solutions. For DET, significance of 
each predictor is calculated using a Bonferroni correction. For stage two, let 
Pmax be the probability of the best contender for the origin of a given direct 
effect, then a 'find' is defined as any other predictor that has a probability 
of origin that is at least 10% as large as the best contender, i.e. O.lpmax- We 
also include results for Fisher's exact test, with a Bonferroni correction, as 
a baseline comparison. 

The comparisons are not designed to be an indicator of which methods 
are universally best. Each method is designed to deal with different sce- 
narios, and here only the ability of the methods to recover true sparsity 



patterns is considered. Moreover, it is difficult to make the comparisons 
fair, for example DET has multiple chances to make a find per significant 
effect identified, while formal significance testing is not used in declaring 
significance for the lasso. 

The response for each individual observation, Yi, is generated according 
to Pr[l^ = 1] = /ij, where 

so that the response is related to a subset of K predictors with indices 
{JIt ■ ■ iJk}- The intercept a is chosen to provide approximately equal 
numbers of observations with Yi = and 1^ = 1; in this way we intend to 
replicate a case-control study. The /3jj, 's are chosen to represent large effect 
sizes so that the differences between methods can be most easily seen; we use 
Pjk = 0.81 and f3jk = 0.405. These correspond to an absolute risk increase 
of 20% and 10% respectively, in the case control framework. The influential 
predictor(s) are chosen at random on each simulation. 

We have generated data in a reasonably simple fashion: for example by 
considering ffist-order auto-regressive predictors. More realistic correlation 
patterns for genetic data can be generated using real LD patterns. However, 
this does not give us the flexibility to change the extent of the correlation 
that we get from the methods used here. Simulations based around real 
datasets which therefore contained more realistic LD patterns where also 
undertaken. As the results were qualitatively the same as those under the 
simpler situations they are not presented. 

If a method finds a predictor that was truly used to generate the re- 
sponse, this is called a 'true find'. Any other predictors found are declared 
'false finds'. A 'strong false find' is recorded whenever at least one incor- 
rect predictor is included in the model; a 'strong true find' means all the 
causal predictors are recovered. A 'perfect find' means the model includes 
all causal predictors and no other predictors (i.e. occurrence of a strong true 
find and non-occurrence of a strong false find). A standard definition of 
false discovery rate (FDR), FDR = To"toi^number'of finds ' is also used. These 
definitions are summarised in Table [TJ 

Since in genetics it is often of interest to detect predictors in high LD 
with a causal variant, we also carry out simulations in which the definition of 
a 'true find' is relaxed so that a true predictor is declared to have been found 
provided any predictor with a correlation of at least 0.9 to the true predictor 
has been found. We denote the results corresponding to this definition the 
HCT (highly correlated with the true predictor) cases. These results are 



Table 1 : Summary of find definitions 



Name 


Description 


True find 


True predictor found 


Strong true find 


All true predictors found 


False find 


False predictor found 


Strong false find 


At least one false predictor found 


Perfect find 


Strong true find and no strong false find 


FDR 


False discovery rate (false finds / total finds) 



only included for some cases, as the omitted scenarios were qualitatively 
the same. To enhance readibility, in all proceeding Figures, false finds are 
displayed on a log scale, whilst the remaining measures are displayed on a 
linear scale. 



The 'glmnet' function in the 'glmnet' library (Friedman et al. , 2011) in 



R ( R Development Core Team , 2008 ) was used to calculate lasso and elastic 



net solutions. All other code was written by the authors. 



3.1 Serially Correlated Data 

To generate serially correlated data, denoting the correlation by p, the ap- 
proach used is: 

1. Generate random binary realisations for Xj^i, i = 1, . . . ,n, according 
to a Bernoulli(0.5) distribution, where n, the sample size, is set to 
n = 1000 here. 

2. For j = 2, . . . ,p, where p is the number of predictors, set to p = 400 



here, generate binary realisations for X. 



«j' 



1, 



, n, via 



X 



«j 



Xij-.i with prob. p, 

with prob. {I- p)/2, 

1 with prob. (1 — p)/2. 



3. Randomly assign a causal predictor with coefficient 0.81, repeating 
this ten times on each dataset. 

4. Repeat steps 1~3 100 times to obtain 100 different datasets, and hence 
1000 simulations in total. 

This procedure is repeated for a range of serial correlations, from p = to 
p = 0.99. 



Figure [T] illustrates the simulated true finds, perfect finds, false finds 
and FDRs for serial correlations ranging from to 1. Most of the methods 
deteriorate in fairly similar ways as the correlation approaches one. The 
exception is DET, whose perfect find rate and FDR do not become worse 
as the correlation becomes very large. Indeed, once the correlation becomes 
close to one, DET has the highest number of perfect finds. This is a con- 
sequence of the second stage of DET accounting for the high correlation by 
giving a large set of potential predictors for each detected effect. It is only 
once p > 0.75, however, that including stage two of DET gives additional 
benefits over stage one alone (DET stage one not shown). For small to mod- 
erate correlation, stability selection gives the highest perfect find rate and 
lowest FDR. Figure [T] also demonstrates poor false find control of the elastic 
net and Fisher test, although the Fisher test, unsurprisingly, copes well at 
low correlations. 

Figure [2] illustrates the HCT case. Stability selection benefits the most 
from this relaxation of the definition of a find: the deterioration in its accu- 
racy for large correlation is prevented. Lasso, the elastic net and the Fisher 
test do not appear to benefit. 

To study consistency, the same data generation as above was used and 
the correlation fixed while the sample size, n, varied. Figure [3] gives true 
finds, perfect finds, false finds and FDRs for the methods with high corre- 
lation, p = 0.95. Figure |4] gives the same for correlation p = 0.5. In both 
cases the number of predictors p = 400. 

Screen and clean has the best consistency properties in the high corre- 
lation scenario, since the perfect find rate increases, and the false discovery 
rate decreases as the sample size increases. Stability selection also performs 
well in the high correlation scenario since the perfect find rate increases, and 
the false discovery rate is controlled (although it does not appear to decrease 
as fast when n increases). DET performs poorly as sample size increases in 
the large correlation case; a large number of false finds are generated for 
large n, and the perfect find rate does not increase. For lasso, elastic net 
and the Fisher test, there appears to be no gain in having a sample size 
larger than 1000. 

For the low correlation case (Figure H|, all the methods considered have 
good consistency properties. DET outperforms the other methods in terms 
of perfect finds and FDR for small n; stability selection and screen and clean 
are the better performers on these measures for larger sample sizes n. 
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Figure 1: Serial correlation: Comparing true finds (top left), perfect finds 
(top right), false finds (bottom left) and false discovery rate (bottom right) 
for different values of p. Grey line: Fisher test, black line: DET, blue line: 
lasso, light blue line: elastic net, green line: screen and clean, purple line: 
stability selection. 
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Figure 2: Serial correlation (HCT case): Comparing true finds (top left), 
perfect finds (top right), false finds (bottom left) and false discovery rate 
(bottom right) for different values of p. Grey line: Fisher test, black line: 
DET, blue line: lasso, light blue line: elastic net, green line: screen and 
clean, purple line: stability selection. 
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Figure 3: Comparing true finds (top left), perfect finds (top riglit), false finds 
(bottom left) and false discovery rate (bottom right) for differing sample 
sizes n, on the log scale. Serial correlation p = 0.9. Grey line: Fisher test, 
black line: DET, blue line: lasso, light blue line: elastic net, green line: 
screen and clean, purple line: stability selection. 
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Figure 4: Comparing true finds (top left), perfect finds (top riglit), false finds 
(bottom left) and false discovery rate (bottom right) for differing sample 
sizes n, on the log scale. Serial correlation p = 0.5. Grey line: Fisher test, 
black line: DET, blue line: lasso, light blue line: elastic net, green line: 
screen and clean, purple line: stability selection. 
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3.2 Clustered Data 

Clustered data is generated with p = 400 predictors, divided into clusters of 
size k, for a range of cluster sizes from fc = 1 to A; = 10. When necessary extra 
predictors are added to the last cluster to obtain the desired total number of 
predictors. Within cluster correlation is set to p and while different clusters 
are independent. The data generation procedure is as follows: 

1. Generate random binary realisations for Xij, i = 1, . . . , n, where the 
sample size is set to n = 1000, according to a Bernoulli(0.5) distri- 
bution for the first predictor Xj in each cluster. Hence this step is 
carried out p/k times. 

2. For each subsequent Xj in each cluster, generate binary realisations 
for Xij, i = 1, . . . ,n via 

{Xij-i with prob. p, 

with prob. (1 - p)/2, 
1 with prob. (1 - p)/2. 

Hence this is carried out k — 1 times in each cluster. 

3. Randomly assign a causal predictor with coefficient 0.81, repeating 
this ten times on each dataset. 

4. Repeat steps 1-3 100 times to obtain 100 different datasets, and hence 
1000 simulations in total. 

Figure [5] visualises the results of various cluster sizes, for a within cluster 
correlation p = 0.9. Stability selection controls the FDR and maintains a 
large number of perfect finds, out-performing all the other methods. Cluster 
size has little effect on the performance of lasso and elastic net, but a large 
detrimental effect on the performance of the Fisher test. 

Figures [6] repeats the cluster size study but within cluster correlation is 
set to p = 0.95. The larger correlation causes stability selection to struggle 
to make perfect finds, but it does continue to control the FDR. DET now 
achieves the largest number of perfect finds, and the second best FDR con- 
trol. Comparing Figures [5] and |6] shows that increasing the correlation has 
little effect on the performance of the lasso, elastic net or Fisher test. 

Consistency of the various methods was also considered in the cluster- 
ing framework. The results are qualitatively similar to those in the serial 
correlation framework, hence are omitted. 
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Figure 5: Comparing true finds (top left), perfect finds (top riglit), false finds 
(bottom left) and false discovery rate (bottom right) for different values of 
cluster sizes, within cluster correlation 0.9. Grey line: Fisher test, black 
line: DET, blue line: lasso, light blue line: elastic net, green line: screen 
and clean, purple line: stability selection. 
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Figure 6: Comparing true finds (top left), perfect finds (top riglit), false finds 
(bottom left) and false discovery rate (bottom right) for different values of 
cluster sizes, within cluster correlation 0.95. Grey line: Fisher test, black 
line: DET, blue line: lasso, light blue line: elastic net, green line: screen 
and clean, purple line: stability selection. 
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Table 2: SNPs in the abacavir data that were declared true finds by the 
various methods considered. 



SNP 


DET 


Lasso 


stab 
Scl 


s & c 


Elastic 
net 


Fisher 


98D 








X 






133D 








X 






150D 












X 


229D 


X 


X 


X 


X 


X 


X 


229R. 












X 


235D 




X 








X 


331R 












X 


334D 




X 






X 


X 


335R 








X 


X 


X 


336D 


X 


X 


X 




X 


X 


340D 












X 


344R 








X 






349D 












X 


39 IR 




X 








X 


392R 




X 








X 


406D 












X 


440D 












X 


440R 












X 


443D 








X 


X 


X 


443R 




X 




X 


X 


X 


488R 








X 






489D 










X 


X 


489R 












X 


529D 




X 










613D 












X 


642D 








X 






738D 




X 






X 


X 


757D 












X 


758D 




X 






X 


X 


811D 












X 



4 Real Data Example 

We compare the methods on a dataset arising from a case-control study 
investigating whether there are SNPs that act as biomarkers for hypersensi- 



tivity (yes or no) to abacavir, a treatment for HIV (Kelly et al. , 2006). We 
will use the combined data from two global trials ( Hetherington et al. , 2002) 
in our investigation. 

The initial dataset consisted of 856 SNPs that have been arbitrarily num- 
bered here and are spread across the 22 autosomal chromosomes. To obtain 
binary data, we separate the dominant and recessive effect of each SNP into 
two predictors and then remove any binary predictor whose prevalence of 
Os or Is is below 5% giving p = 1373 binary predictors for our evaluations. 
The methods are applied and the SNPs that have been declared 'finds' are 
recorded in Table [H 'Perfect finds' cannot be considered here as the true 
relationship is unknown. 

The most conservative methods are DET and stability selection, both of 
which find the same two SNPs significant — the dominant effects on SNP 
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229 and SNP 336. The same two SNPs are identified with ah other methods 
except screen and clean. There is large agreement between the methods 
on which SNPs to include, although some methods detect more SNPs than 
others. Screen and clean is the exception: it identifies SNPs that are not 
found with other methods, not even the Fisher test which identifies by far 
the most SNPs (24). Only four out of the nine SNPs found with screen and 
clean are identified with any of the other methods. 

5 Discussion 

This paper presents a thorough simulation study of the performance of var- 
ious methods at recovering true sparsity patterns, specifically with appli- 
cation in genetic studies. Many of these methods are not designed for the 
objective of true sparsity recovery but still performed well in recovering true 
predictors. For lasso, elastic net and the Fisher test, this is at the expense 
of a large number of false finds; stability selection and DET achieve the 
true finds without too many false finds. Once the correlation becomes high, 
specialised methods such as DET are needed. The main drawback of DET, 
however, is that performance does not improve notably with sample size. 
In our simulations screen and clean does benefits most from larger sample 
sizes. 

Stability selection performed extremely well. At first glance, it seems 
that because of the bootstrapping approach used, even moderate correla- 
tion may cause problems for the method. Whilst it does fail for very large 
correlations, datasets with neighbouring correlations as high as p = 0.95 are 
handled well. Indeed, until the correlation reaches this high level, stabil- 



ity selection also controls the type one error. Meinshausen and Biihlmann 



(2010) proved type one error control for exchangeable coefficients, the work 
here represents further empirical evidence that similar control is achieved 
in many non-exchangeable situations. Indeed, when the definition of a true 
find is relaxed to declaration of a predictor with correlation of at least 0.9 
with the true predictor, stability selection performs best in all scenarios. 

As the density of information collected increases, measured predictors 
will become increasingly correlated. So methods that are specifically de- 
signed to handle this, such as direct effect testing, will become more im- 
portant. DET, however, only deals with binary predictors and response 
and therefore, novel methods that can handle highly correlated continuous 
variables are needed. 
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